recordExtractor

The recordExtractor action is a function that takes an object as an input parameter and returns a list of records.

JavaScript

recordExtractor: ({ url, $, contentLength, fileType})  => {
return [
    {
    url: url.href,
    text: $('p').html()
    ... /* Anything you want */
    }
];
// return []; skips the page
}

Parameters

Specify one or more response parameters in your recordExtractor to determine what information is returned.

object

A Cheerio instance with the HTML of the crawled page. For more information, see Extracting data with Cheerio.

contentLength

number

The size of the crawled page in bytes.

dataSources

object

The external data sources of the current URL. Each key of this object corresponds to an externalData object. For example:

JavaScript

{
  dataSources: {
    dataSourceId1: { data1: 'val1', data2: 'val2' },
    dataSourceId2: { data1: 'val1', data2: 'val2' },
  }
}

filetype

string

The file type of the crawled page or document.

helpers

function

Helpers are functions that help extract content and generate records. This can help simplify your record extractor.

url

object

A Location object that contains the URL.

Returns

The record extractor returns an array of records with attributes or an empty array. If it returns an empty array, the page is skipped (isn’t crawled).

Tools

Crawler

Parameters

Returns

Tools

Crawler

​Parameters

​Returns

Parameters

Returns